cropCSM: designing safe and potent herbicides with graph-based signatures

Abstract Herbicides have revolutionised weed management, increased crop yields and improved profitability allowing for an increase in worldwide food security. Their widespread use, however, has also led to a rise in resistance and concerns about their environmental impact. Despite the need for potent and safe herbicidal molecules, no herbicide with a new mode of action has reached the market in 30 years. Although development of computational approaches has proven invaluable to guide rational drug discovery pipelines, leading to higher hit rates and lower attrition due to poor toxicity, little has been done in contrast for herbicide design. To fill this gap, we have developed cropCSM, a computational platform to help identify new, potent, nontoxic and environmentally safe herbicides. By using a knowledge-based approach, we identified physicochemical properties and substructures enriched in safe herbicides. By representing the small molecules as a graph, we leveraged these insights to guide the development of predictive models trained and tested on the largest collected data set of molecules with experimentally characterised herbicidal profiles to date (over 4500 compounds). In addition, we developed six new environmental and human toxicity predictors, spanning five different species to assist in molecule prioritisation. cropCSM was able to correctly identify 97% of herbicides currently available commercially, while predicting toxicity profiles with accuracies of up to 92%. We believe cropCSM will be an essential tool for the enrichment of screening libraries and to guide the development of potent and safe herbicides. We have made the method freely available through a user-friendly webserver at http://biosig.unimelb.edu.au/crop_csm.


Introduction
Herbicides are widely used chemical agents capable of killing or inhibiting growth of unwanted plants, including weeds and grass types, that might compromise crop yields. Over the years, their adoption has revolutionised weed management, increased crop yields and improved profitability allowing for an increase in worldwide food security.
Their widespread use, however, has also led to a rise in resistance [1]. Without appropriate measures to bring herbicides with new modes of action to the market, combined with a concerted global effort at product stewardship and regulation, herbicide resistance could reduce world food production in the coming years by 20-40%, leading to a global food security crisis [2]. Although agricultural practices have improved, decades of success by glyphosate and spiralling costs have stymied herbicide development. Designing potent herbicides is particularly challenging considering toxicity concerns, and currently available and widely used weed control compounds have shown to be potentially harmful for the environment, livestock and human health. Exposure to agrochemicals has been, for instance, shown to be one of the potential drivers to the decline of pollinators worldwide [3]. Despite the need for novel potent and safe herbicidal molecules, no herbicide with a new mode of action has reached the market in 30 years [4].
Developing herbicides, much like pharmaceuticals, involves a careful balance between efficacy and safety. In the pharmaceutical industry, drug development pipelines have tackled these challenges by modelling and optimising these important parameters early in the development process, which has been assisted by the implementation of computational pipelines. This has led, in general, to increased hit rates and decreased attrition due to poor toxicity profiles and, in the process, reduced development time, costs and animal testing [5][6][7][8][9][10]. Although many computer-aided approaches have proven invaluable for drug development, in contrast little has been done to aid the development of safe and potent agrochemicals. A relevant recent development has been the work of Oršolić et al. to establish structure-activity relationships linking herbicides with modes of action and weed selectivity [11].
To fill this gap, here we describe the design of the first computational method to support and guide rational development of safe and potent herbicides, cropCSM, using the largest curated small molecule database of experimentally characterised herbicidal activity to date. Our method uses the concept of graph-based signatures, represented as a cutoff scanning matrix (CSM) (Figure 1), to model small molecule physicochemical properties coupled with machine learning to train and validate predictive tools capable of accurately identifying molecules with potent herbicidal activity, as well as characterise their environmental and human toxicity profiles.

Materials and methods
cropCSM development was divided into four main steps, including (i) data acquisition, which involved obtaining data relating small molecules to their herbicidal activity as well as human and environmental toxicity profiles; (ii) feature engineering, which involved calculating molecular properties and performing graph-based modelling to extracting relevant information used as evidence in the next steps to (iii) identify what makes up potent and safe herbicides, identifying enriched substructures and qualitatively assessing molecular properties and to (iv) train and test predictive model using supervised learning algorithms.

Data for herbicidal activity
A dataset of 4513 experimentally characterized, structurally diverse small molecules and their herbicidal activity profiles was obtained from Sukhoverkov et al. [12]. These were labelled either as active (997 molecules) or inactive (3516 molecules). They had an average molecular weight of 380 Da and logP of 2.4 ( Figure S1). A database of 356 commercial herbicides was also used to evaluate cropCSM [12].

Data for environmental and human toxicity
We have developed new predictors based on six environmental and human toxicity data sets with experimentally characterised molecules. Environmental toxicity data sets included (a) honey bee (Apis mellifera) toxicity, which was composed of 247 toxic and 353 atoxic molecules [13]; (b) avian toxicity, composed of 461 small molecules and their effects on mallard duck (66 toxic and 395 atoxic) [14] and (c) flathead minnow toxicity, with lethal concentration values (LC50) for a diverse set of 554 molecules [15]. Human toxicity data sets included (1) AMES toxicity, with compounds labelled based on their carcinogenic potential (4632 carcinogenic and 3470 not Performance of herbicide and environmental-toxicity predictors is shown in (B). Our herbicide predictor was able to accurately identify active compounds with AUC > 0.85 on cross-validation and blind test. Three environmental toxicity models have been developed and were capable of successfully measuring minnow toxicity (as a regression task, centre graph) as well as identifying potentially harmful compounds for Bees and Mallard (right-hand side graph). Sensitivity refers to the true positive rate, whereas specificity to the true negative rate. carcinogenic) [16]; (2) oral acute toxicity in rats, denoted as lethal dose (LD50) values for 10 145 compounds [17] and (3) oral chronic toxicity in rats values for 567 compounds [18].

Graph-based signatures and feature engineering
Graph modelling is an invaluable tool to model biological entities, including small molecules [19][20][21][22]. Over the years we have proposed and developed the concept of graph-based signatures (based on CSM concept [23]) to represent physicochemical and geometrical properties of a range of macromolecules [24][25][26][27][28][29] and their interactions [30][31][32][33][34][35][36][37]. These have also been successfully adapted to represent small molecules pharmacokinetics, toxicity and bioactivity [38][39][40][41]. Figure 1 depicts the main steps involved in feature engineering with graph-based signatures. Small molecules are modelled as unweighted, undirected graphs where nodes represent atoms and edges represent covalent bonds. Via pharmacophore modelling [38], atoms/nodes are labelled based on their physicochemical properties. This atomic graph representation of small molecules accounts for both their shape and composition, from which information is extracted as distance patterns. To do so, all-pairs shortest paths distances are calculated, and molecules are then represented simply as cumulative distribution functions of atom distances. These distances are further categorised based on their respective physicochemical properties (pharmacophores) and converted as a feature vector used as evidence to train and test predictive methods. Complementary physicochemical properties are calculated and included using the RDKit cheminformatics library [42] and included in the feature vector.

Model selection and validation
Different supervised learning algorithms available on the scikit-learn Python library [43] were assessed with best performing models (Random Forest, n_estimators = 300 for herbicide activity). Information on learning algorithms and hyperparameters for remaining models is available as Table S1 and selected based on accuracy, Matthew's correlation coefficient (MCC) and the area under the Receiver operating characteristic curve (ROC curve)  (AUC) for classification tasks and Pearson's correlation [44] and root mean squared error (RMSE) for regression tasks. Hyperparameter tuning was performed with a grid search method implemented in scikit-learn; however, no significant improvement in performance was observed. Performance was assessed under stratified 10-fold crossvalidation (obtained using scikit-learn Python library) [29,30,45] as well as using nonredundant blind tests (blind tests were composed of held out sets accounting for 10% of molecules selected at random and used as external validation). A feature selection step was used to reduce dimensionality and improve performance via a stepwise forward greedy selection approach [24,[28][29][30][46][47][48].

Substructure mining
To identify what makes up a herbicide, molecular substructure mining was employed to identify substructures that were enriched in the herbicidal class and depleted in the nonherbicidal class. For this purpose, the molecular substructure miner (MoSS) [49] tool was used, and different minimum support/frequency cutoffs tested. We considered 5-and 10-fold differences between support of positive and negative classes (1%, 5%, 1-10% and 2-10%).

Data visualization
t-distributed stochastic neighbor embedding (t-SNE) plots were generated using the R package 'tsne' with default parameters.

What makes an herbicide? Correlating molecular properties with herbicidal activity
Using experimental information on the herbicidal activity for a collected data set of over 4500 small molecule compounds (22% with herbicidal activity), we investigated what physicochemical properties of the compounds translate to herbicidal activity. Herbicidal molecules were enriched in saturated carbon chains and benzene substructures, compared with the inactive molecules (Figure 2A). The majority (90%) of the active compounds tended to be < 517 Da, have up to 9 acceptors and 4 donors, fewer than 9 rotatable bonds and a logP between −1.7 and 6.1 ( Figure S1) (95% <700 Da, 11 rotatable bonds, 11 acceptors, 6 donors and logP between −3.0 to 6.1). This is similar, although slightly more lenient, than the widely used Lipinski's Rule of Five for orally bioavailable drugs. Interestingly, but consistently, there was no significant distinction in physicochemical properties between herbicides and approved drugs, as illustrated in the t-SNE plot ( Figure S2). Compared with all U.S. food and drug administration (FDA)approved drugs, however, herbicides were enriched in substructures involving chlorine.
Herbicides have been previously compared with pharmaceuticals [10] with our analysis being consistent with previous results across smaller datasets [50][51][52][53][54], which have shown that physicochemical properties of herbicides are similar to orally delivered drugs, although the Performance on cross-validation of human-toxicity predictors. Three models have been developed and were capable of accurately measuring acute and chronic rat toxicity (as regression tasks, left and centre graphs) as well as identifying potentially carcinogenic compounds (Ames toxicity as a classification task, right-hand side graph). For regression models, two Pearson's correlation coefficients are shown: one obtained for the whole data set (in black) and one on 90% of the data, after 10% of outliers or poorly predicted points are removed (in red) for analysis purposes to assess the potential effects of outliers on performance (depicted as red dots).  former tend to be smaller, with fewer proton donors, lower partition coefficient [53,54].
These insights were used as a platform to build a supervised machine learning predictive model to identify herbicidal compounds, where the small molecule structure was represented as a graph-based signature, termed CSM, in which the atoms are represented as nodes, and covalent interactions between them as edges [38,45]. Under stratified 10-fold cross-validation, we were able to correctly identify 82% of the active molecules with an overall accuracy of 87%, area under the ROC curve (AUC) of 0.85 and MCC of 0.60 ( Figure 2B and Table 1). For compounds over 500 Da, cropCSM achieved an AUC of 0.81, illustrating the robust performance of the approach on larger compounds. A precision-recall curve (PR curve) was also calculated ( Figure S3) to investigate the compromises between type I and type II errors. cropCSM achieved a PR AUC of 0.93, demonstrating a well-tuned model, with good performance and balance between type error I and II. When the model was evaluated against a blind test set of 106 active and 345 inactive molecules, we achieved comparable performance (87% accuracy, AUC of 0.87 and MCC of 0.59). This provided confidence that the approach can be generalized and used with unknown sets of putative herbicidal molecules active against a target of interest.
To further validate the cropCSM models and demonstrate their real-world applicability, they were applied to a set of 356 commercial herbicides [12]. Over 97% were correctly identified as herbicidal (Figure 3). Of those that were not, they included the natural fatty acid oleic acid and fragment-like molecules such as dazomet and pentachlorophenol.

Predicting environmental toxicity
Agrochemicals have been linked to a range of unwanted negative effects on both health and the environment, including glyphosate-free herbicides such as benzalkonium chloride [55]. To help identify safe herbicides, complementary models were developed to capture the impact of a small molecule on honey bee (A. mellifera), mallard (Anas platyrhynchos) and f lathead minnow (Pimephales promelas) toxicity, in addition to measures of human health, including AMES toxicity, rat LD50 and oral chronic toxicity. Although assessing molecular substructures enriched in toxic compounds (based on these datasets), (Figure S4), we identified a prevalence of complex ring structures. Of note, structures rich in chlorine, while enriched in herbicides, were also enriched in compounds that were toxic for mallard and minnow, highlighting a potential inherent difficulty in optimising potency and safety when designing herbicides. Tables 2 and 3 depict the performance of cropCSM models for identifying toxic molecules as regression and classification tasks, respectively, assessed during crossvalidation. Overall, we were also able to identify toxic molecules as classification and regression tasks with accuracies of up to 92% and Pearson's correlations of up to 0.86, outperforming previous predictive approaches [13,14,38,56], including pkCSM, admetSAR and the works from Wang et al. and Zhang et al. Regarding environmental toxicity models, the minnow toxicity (LC50) model achieved a Pearson's correlation of 0.86 (RMSE of 0.74), a performance that increases to 0.93 on 90% of the data (after removing 10% outliers -these are only removed for analysis purposes to assess the potential effects of outliers on performance), whereas honeybee and avian toxicity classifiers achieved AUCs of 0.81 and 0.83 ( Figure 2B). For human health toxicity predictors, cropCSM achieve Pearson's correlation coefficients of 0.79 and 0.75 for oral rat acute and chronic toxicity, respectively, and an AUC of 0.94 while predicting AMES compounds (Figure 4).
Using these validated methods, we set out to assess the toxicity profiles of currently commercially available herbicides. The list of 356 commercial herbicides was submitted to the predictors as an independent test. Figure 3 depicts a large proportion of compounds predicted to be environmental or human toxic, with molecules more frequently predicted as either AMES toxic (17%, 60 out of 356) or minnow toxic (20%, 70 out of 356). Interestingly, widely used herbicides glyphosate and glufosinate were predicted as nontoxic to humans. This is interesting, as while glyphosates have been proposed to be carcinogenic, there is growing support that it is the inert products that are responsible for damage to human beings and the environment [57], which would be consistent with the predictions from cropCSM. When grouping commercial herbicides based on their mode of action, a significant proportion of minnow toxic compounds belonged to the class of inhibitors of acetyl CoA carboxylase (14/24, 58% versus 20% overall), AMES toxic compounds were enriched in the class of Inhibitors of cell division (10/28, 36% versus 17% overall) and compounds involved in Rat Chronic toxicity were more frequent in the class of inhibitors of lipid synthesis (7/19, 37% versus 14% overall). These results add credence to the tool to rapidly identify potentially hazardous molecules early in the development process, which has the potential to significantly reduce costs and failure rates.

The cropCSM web server
The backend of the cropCSM web server was developed using the Python Flask framework version 0.12.3 and the front end using Bootstrap framework version 3.3.7. The system is hosted by a Linux server running Apache. Figure 5A depicts the cropCSM web interface. Users can submit molecules to the server either individually as SMILES strings or in batch as SMILES files ( Figure 5B). Prediction results are displayed in tabular format ( Figure 5C), which are also made available to download as a comma-separated file.

Conclusions
Here we described cropCSM, the first free and easy-touse in silico platform dedicated to assist the development of herbicides that are potent, but also nontoxic and environmentally safe. cropCSM is optimally designed to evaluate the herbicide potential and toxicity of small molecular compounds up to 2 kDa, consistent with the training and validation data sets. We employed graphbased signatures to model small molecule physicochemistry of the largest collected data set of molecules with experimentally characterised herbicidal profiles to date and demonstrated their efficacy. We anticipate future iterations of cropCSM that will draw upon larger datasets and as a result will have a higher predictor capability, allowing for a greater increase in accuracy and correlation. We believe cropCSM will be an invaluable tool to assist rational design of new potent herbicidal molecules that are also safe for humans and the environment. The tool has been made available as a user-friendly web interface at https://biosig.unimelb.edu.au/crop_csm.

Key Points
• Although very little has been done to assist the designing of potent and safe herbicides computationally, cropCSM fills this gap. • cropCSM can accurately identify small molecules with potential herbicidal activity as well as characterise their toxicity and safety profiles. • A web server conveniently provides cropCSM's models in a free and easy-to-use interface.

Supplementary data
Supplementary data are available online at https:// academic.oup.com/bib.